Background: Advanced stage (III/IV) diffuse large B-cell lymphoma (DLBCL) at diagnosis is associated with inferior outcomes and often requires intensified therapy. Early identification of high-risk patients could facilitate timely interventions and improved outcomes. Traditional clinical assessments may miss subtle risk factors, while machine learning approaches can capture complex patterns in clinical data. We developed a Random Forest (RF) classifier to predict advanced stage DLBCL at diagnosis, utilizing interpretable machine learning to uncover actionable risk patterns.

Methods: From the SEER database (2000-2020), we analyzed 126,774 DLBCL patients with complete Ann Arbor staging information. Features included demographics (age, sex, race/ethnicity, marital status), tumor characteristics (primary site one-hot encoded for category-specific effects, histology), and diagnostic methods. The dataset was stratified-split 70/30 (train/test) with balanced class weights to address the imbalanced nature of advanced stage (32.6% prevalence). RF was trained using 300 trees, minimum samples per leaf of 10, and 5-fold cross-validation for hyperparameter tuning. Evaluation included AUC, accuracy, sensitivity/specificity, confusion matrix, learning curves, and SHAP for directional feature attribution.

Results: The Random Forest model attained a test AUC of 0.723 and accuracy of 0.678, outperforming traditional prognostic models. Feature importance analysis identified primary site categories (importance 0.313 for site 778, 0.115 for site 770), radiation status (importance 0.115), and chemotherapy receipt (importance 0.026) as dominant predictors. Cross-validation confirmed robust performance (AUC 0.720 ± 0.053). The model achieved sensitivity of 0.668 and specificity of 0.683, with high-risk patients demonstrating 53.1% advanced stage prevalence versus 14.0% in low-risk patients. The model successfully identified site-specific risk patterns and treatment-related factors that influence stage at presentation.

Conclusion: Random Forest offers accurate, interpretable forecasting of advanced DLBCL stage, pinpointing high-risk profiles for enhanced surveillance. The superior performance over traditional approaches highlights the value of machine learning in capturing complex risk patterns. These findings support the development of targeted screening strategies and early intervention protocols, with prospective validation recommended to assess clinical implementation and impact on outcomes.

Clinical Relevance: This study demonstrates that machine learning can enhance early risk assessment in DLBCL by identifying modifiable risk factors and demographic disparities. The identification of site-specific and demographic risk patterns provides opportunities for targeted screening and early intervention. The superior predictive accuracy suggests potential for improved clinical decision-making and resource allocation in routine practice, particularly in resource-limited settings.

This content is only available as a PDF.
Sign in via your Institution